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Abstract 

Background: Chromatin immunoprecipitation (ChIP) followed by high-throughput sequencing (ChlP-seq) or ChIP 
followed by genome tiling array analysis (ChlP-chip) have become standard technologies for genome-wide 
identification of DNA-binding protein target sites. A number of algorithms have been developed in parallel that 
allow identification of binding sites from ChlP-seq or ChlP-chip datasets and subsequent visualization in the 
University of California Santa Cruz (UCSC) Genome Browser as custom annotation tracks. However, summarizing 
these tracks can be a daunting task, particularly if there are a large number of binding sites or the binding sites are 
distributed widely across the genome. 

Results: We have developed ChlPpeakAnno as a Bioconductor package within the statistical programming 
environment R to facilitate batch annotation of enriched peaks identified from ChlP-seq, ChlP-chip, cap analysis of 
gene expression (CAGE) or any experiments resulting in a large number of enriched genomic regions. The binding 
sites annotated with ChlPpeakAnno can be viewed easily as a table, a pie chart or plotted in histogram form, 
i.e., the distribution of distances to the nearest genes for each set of peaks. In addition, we have implemented 
functionalities for determining the significance of overlap between replicates or binding sites among transcription 
factors within a complex, and for drawing Venn diagrams to visualize the extent of the overlap between replicates. 
Furthermore, the package includes functionalities to retrieve sequences flanking putative binding sites for PCR 
amplification, cloning, or motif discovery, and to identify Gene Ontology (GO) terms associated with adjacent 
genes. 

Conclusions: ChlPpeakAnno enables batch annotation of the binding sites identified from ChlP-seq, ChlP-chip, 
CAGE or any technology that results in a large number of enriched genomic regions within the statistical 
programming environment R. Allowing users to pass their own annotation data such as a different Chromatin 
immunoprecipitation (ChIP) preparation and a dataset from literature, or existing annotation packages, such as 
GenomicFeatures and BSgenome, provides flexibility. Tight integration to the biomaRt package enables up-to-date 
annotation retrieval from the BioMart database. 



Background 

ChIP followed by high-throughput sequencing (ChlP- 
seq) and ChIP followed by genome tiling array analysis 
(ChlP-chip) have become standard high-throughput 
technologies for genome-wide identification of DNA- 
binding protein target sites [1-4]. A number of algo- 
rithms and tools have been developed for analyzing the 
large datasets generated by ChlP-chip (reviewed in [4]) 
and ChlP-seq experiments [1,5-10]. The output from 
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such algorithms is typically presented as a list of binding 
sites (also referred to as peaks) that are significantly 
enriched in the ChIP sample compared to the control 
sample(s). The identified binding sites are usually con- 
verted to a format, such as BED or Wiggle (WIG), that 
can be uploaded to the UCSC Genome Browser, an 
open-access, web-based, up-to-date source for genome 
sequence data integrated with a large collection of 
related annotations [11]. This resource allows the user 
to build a custom annotation track to view the proxi- 
mity of the DNA-binding sites to various genomic fea- 
tures such as genes, exons, transcription start sites and 
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conserved elements. However, searching the UCSC Gen- 
ome Browser can be a daunting task for the user, parti- 
cularly if there are a large number of binding sites or 
the binding sites are distributed widely across the 
genome. 

Several useful web applications have been developed 
for managing and annotating ChlP-chip data [12-14] and 
ChlP-seq data [14]. However, there is a need for technol- 
ogy platform-independent and genome-independent 
batch annotation tools. Here we describe a Bioconductor 
package called ChlPpeakAnno that facilitates batch anno- 
tation, using a variety of annotation sources, of binding 
sites identified by any technologies which result in large 
number of enriched genomic regions, such as ChIP- chip, 
ChlP-seq and CAGE. ChlPpeakAnno leverages the statis- 
tical environment R/Bioconductor with various sources 
of annotations, such as Ensembl, the UCSC genome data- 
base and others. In addition, users have the flexibility to 
label enriched regions with any annotation of interest 
such as a dataset from the literature. This package is 
available from Bioconductor, an open source and open 
development software project specializing in biological 
data analysis and integration based on R, a system for 
statistical computation and graphics [15,16]. Bioconduc- 
tor tools are distributed as separate but interoperable 
packages, each specializing in different areas of biological 
data analysis, such as the limma package for analyzing 
microarray data [17] and the biomaRt package for 
retrieving genomic annotation from the federated query 
system BioMart Ensembl [11,18,19]. The ChlPpeakAnno 
package contains various functionalities to batch-annotate 
enriched regions identified from ChlP-seq, ChlP-chip or 
CAGE experiments. 

ChlPpeakAnno emphasizes flexibility, integration and 
ease of use. Users are supplied with functionalities for 
annotating peaks from ChlP-seq, ChlP-chip, CAGE or 
any experiment resulting in a list of chromosome coor- 
dinates with any annotation track users are interested 
in. Even though some of the functionalities such as the 
retrieval of neighbouring sequences for a set of peaks 
are available in other software programs, the complete 
set of tools and the flexibility offered by ChlPpeakAnno 
are not available in any other software. The main differ- 
entiating point from CEAS, CisGenome and other soft- 
ware is that ChlPpeakAnno enables comparison between 
a set of peaks with any annotation feature objects, for 
example comparing to CpG islands, to conserved ele- 
ments (or other annotated features not captured by 
CEAS http://ceas.cbi.pku.edu.cn/submit.htm or CisGen- 
ome http://www.biostat.jhsph.edu/~hji/cisgenome/index. 
htm) (survey results) or comparing two sets of peaks 
between replicates (personal communication with Ivan 
Gregoretti at NIH) or transcription factors within a 
complex (unpublished data). In addition, unlike 



ChlPpeakAnno, CEAS or CisGenome does not have 
overlapping significance testing or Gene Ontology (GO) 
enrichment testing implemented. GO is a system for 
describing the molecular function, biological process 
and cellular compartmentalization of gene products 
[20]. Another main advantage of ChlPpeakAnno is the 
ability/flexibility to plug in with other annotation 
packages, such as biomaRt [17] and GO.db, ChlP-chip 
analysis packages such as Ringo [21] and ACME [22], 
other fast moving deep-sequencing analysis capabilities 
and infrastructure (Table 1) such as ShortRead [23], 
DEGseq [24], edgeR [25], BayesPeak [26], chipseq, ChlP- 
seqR, Rolexa [27], BSgenome, IRanges, Biostrings, rtrack- 
layer [28], GenomeGraphs [29] and statistical analysis 
tools such as multtest and limma [17] in Bioconductor 
(survey results). 

Usability is the top priority for ChlPpeakAnno. Once 
the package is loaded, one line of code (annotatePeakln- 
Batch) enables users to find nearest or overlapping fea- 
tures such as gene, exon, miRNA, 5' utr, 3' utr, peaks 
from another dataset or any annotation track of interest. 
Users are also provided with the flexibility and function- 
ality to get the annotation on the fly (get Annotation) . 
Two lines of code (getEnrichedGO) allow users to find 
enriched gene ontology terms. One line of code (make- 
VennDiagram) allows users to draw a Venn diagram 
and provide a p-value for determining the significance 
of the overlapping between datasets. Repeated calling of 
function findOverlappingPeaks enables users to find the 
overlapping among peaks from several different experi- 
ments, which will help users to determine how peaks 
from different replicates overlap and how peaks from 
different transcription factors within a complex overlap. 

Implementation 

ChlPpeakAnno implements a common annotation work- 
flow for ChlP-seq or ChlP-chip data in R, a system for 
statistical computation and graphics [15,16]. To promote 
component reuse and compatibility among Bioconduc- 
tor packages, ChlPpeakAnno utilizes the IRanges pack- 
age and represents the peak list as RangedData to 
efficiently find the nearest or overlapping gene, exon, 5' 
utr, 3' utr, microRNA (miRNA) or other custom feature 
(s) supplied by the user, such as the most conserved 
non-coding element, CpG island or transcription factor 
binding sites. All peak-calling software produces a file 
containing at least a list of chromosome coordinates 
that is all ChlPpeakAnno package needs. Both BED 
http:/ /genome. ucsc. edu/FAQ/FAQformat#formatl and 
GFF (General Feature Format, http://genome.ucsc.edu/ 
FAQ/FAQformat#format3) are common file formats 
that provide a flexible way to define peaks or annota- 
tions as data lines. Therefore, conversion functions 
BED2RangedData and GFF2RangedData were 
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Table 1 An overview of Bioconductor packages for analyzing high-throughput sequencing data. 



Package 


Classification 


Functionalities 


ShortRead 


Input/Output 
QA 

Filtering 


Supplies methods for reading, quality assessment (QA) and basic manipulation of high-throughput 
sequencing data. 


Rolexa 


Base Calling 
QA 


Supports probabilistic base calling, quality checks and diagnostic plots for Solexa sequencing data. 


IRanges 


Infrastructure 
Ranged-based algorithm 


Provides infrastructure for representing and manipulating sets of integer ranges, and implements 
algorithms for range-based calculations such as intersect, union, disjoint, overlap and coverage. 


BSgenome 


Whole Genome 
Annotation Data 


Supplies infrastructure for efficiently representing, accessing and analyzing whole genome. 


Biostrings 


String manipulation 


implements functions for pattern matching, sequence alignment and string manipulation 


1 i! ULKIUycl 


\/ici i ^ 1 \~7^t\t^ ri 
v IbUdllZdUUl 1 


rioviues an iiieiidce oeiween n anu genome uiowseis anu implements luiicuuns lu impoiL, cieaie, 
export, and display track data by linking R with existing genome browsers. 


Genom^Grophs 




ntegrates Ensembl annotation obtained using the biomaRt package and the grid graphic package to 
facilitate visualization, plotting and analysis of a diverse genomic datasets. 


ChlPpeokAnno 


Annot3tion 
Plotting 
Overlap test 
Enrichment test 


implements a common annotation workflow for ChlP-seq data such as finding nearest or overlapping 
features and obtaining enriched GO terms. In addition, it contains functions for determining the 
significance of the overlap and visualizing the overlap as a Venn diagram among different datasets. 


Genominator 


Annotation 
Summarization 


Offers an interface for storing and retrieving genomic data in SQLite database. 


ChlPsim 


Simulation of ChlP-seq 
experiments 


Provides a framework for the simulation of ChlP-seq experiments such as nucleosome positioning and 
transcription factor binding sites. 


chipseq* 


Analysis of ChlP-seq data 


implements basic workflow for analyzing ChlP-seq experiments, including functions to extend reads, 
calculating genomic coverage, and identifying peaks. 


Cj/in 




l.oi ili luuies meLiiuus lo numiaiize me count uaid diiu ueiecL piuieiu uounu genomic legions wmi 
controlled false discovery rate through random permutation. Models the sequence counts as poison 
distribution. 


BayesPeak* 




Identifies peaks using hidden Markov models and Bayesian statistical methodology. Models the sequence 
counts as the negative binomial distribution. 


ChlPsegR 


Analysis of nucleosome 
ChlP-seq data 


Furnishes functions to analyze nucleosome ChlP-seq data and may be adapted to handle other types of 
ChlP-seq experiments. 


edgeR 


Analysis of RNA-seq data 


Provides statistical routines for determining differential expression in count-based expression data such as 
RNA-seq, SAGE and CAGE. The RNA-seq data are modelled as the negative binomial distribution and 
applied with empirical Bays procedure. 


DEGseq 




implements functions for identifying differentially expressed genes from RNA-seq data by modelling the 
RNA-seq data as the binomial distribution. 


baySeq 




Contains methods to determine differential expression in count based expression data with more 
complex experimental designs using Bayesian methods. 


DESeq* 




Provides functions for identifying differentially expressed genes from RNA-seq data by modelling the 
RNA-seq data as the negative binomial distribution. 


goseq* 


Enrichment testing of 
RNA-seq data 


GO enrichment testing for RNA-seq data. 



•Available in BioC 2.6 in R 2.11.0. 



implemented for converting these data formats to a 
RangedData object. Since the genome annotations are 
updated periodically/ frequently, we have leveraged the 
biomaRt package from Bioconductor to enable retrieval 
of annotation data on the fly from Ensembl. For fast 
access, transcription start sites from common genomes 
such as TSS. human. NCBI36, TSS.human.GRCh37, Exon- 
PlusUtr.human.GRCh37, TSS.mouse.NCBIM37, TSS.rat. 
RGSC3.4 and TSS.zebrafish.Zv8 were included as pre- 
built annotation data packages. Users also have the flex- 
ibility to pass annotation data from the GenomicFeatures 



package as well as their own annotation data, such as a 
list of binding sites from other transcription factors, a 
different ChIP preparation or a different peak-calling 
algorithm. We have also utilized the BSgenome package 
to implement functions that allow retrieval of flanking 
sequences associated with peaks of interest. This facili- 
tates subsequent PCR amplification, cloning and/or 
motif discovery using algorithms such as MEME [3,30]. 
To ascertain whether the identified peaks are enriched 
around genes with certain GO terms, we have imple- 
mented a GO enrichment test. This test applies the 
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hypergeometric test phyper in R and integrates with GO 
annotation from the GO.db package, species-specific GO 
annotation packages such as org.Hs.eg.db and multipli- 
city adjustment functions from the multtest package in 
Bioconductor. GO annotation packages are updated per 
release that corresponds to twice a year. Binding sites 
annotated with ChlPpeakAnno can be exported as an 
Excel file to allow easy sorting and statistical analysis of 
larger lists of peaks. Alternatively, the distribution of 
peaks relative to genomic features of interest (e.g., tran- 
scription start site or exon start site) can be easily 
plotted for viewing as pie chart or histograms. In addi- 
tion, we have implemented functionalities using hyper- 
geometric test for determining the significance of 
overlap between replicate experiments, different peak- 
calling algorithms or binding sites among transcription 
factors within a complex, and for drawing Venn dia- 
grams to visualize the extent of the overlap between 
replicates. 

Results 

Example 1: Finding the nearest gene and the distance to 
the transcription start site of the nearest gene 

The output from ChlP-seq or ChlP-chip analysis is a list 
of binding sites (as chromosome locations) that are sig- 
nificantly enriched in the ChIP sample(s) compared with 
the corresponding control sample(s). The example 
below details how to find the nearest gene and the dis- 
tance to the transcription start site (TSS) of the nearest 
gene in the human genome for a list of binding sites 
(named myPeakList) of type RangedData. The distance 
is calculated as the distance between the start of the 
binding site and the TSS, which is the gene start for 
genes located on the forward strand and the gene end 
for genes located on the reverse strand. 

The first step is to load the ChlPpeakAnno package, 
an example dataset and an annotation dataset. In this 
example, the example dataset contains putative STAT1- 
binding regions identified in un-stimulated cells [2], and 
the annotation dataset contains the TSS coordinates and 
strand information from human GRCh37. 

>library( ChlPpeakAnno) 

>data(myPeakList) 

>data(TSS.human.GRCh37) 

In the next step, the function a nnotatePeaklnBatch is 
called to find the gene with nearest TSS or overlapping 
gene that is not the nearest TSS and corresponding dis- 
tance for the list of binding regions. Sometimes, a peak is 
within a gene but far from the gene's TSS. Setting the 
parameter output to "both" outputs both the genes with 
nearest TSS and the overlapping gene. The parameter 
maxgap sets the maximum gap to be considered as over- 
lapping. The parameter multiple indicates whether multi- 
ple overlapping features should be returned for one peak. 



>annotatedPeak = annotatePeaklnBatch (myPeakList, 
AnnotationData = TSS.human.GRCh37, output="both", 
multiple = F, maxgap = 0) 

The annotated peaks can be saved as an Excel file for 
biologists to view easily. 

>write.table(as.data.frame(annotatedPeak), file = 
"annotatedPeakList.xls", sep = "\t", row.names - FALSE) 

Plotting the distribution of the peaks relative to the 
TSS gives a birds-eye view of the peak distribution rela- 
tive to the genomic features of interest. 

>y = annotatedPeak$distancetoFeature [!is.na(annota- 
tedPeak$distancetoFeature) &iannotatedPeak$fromOver- 
lappingOrNearest == "NearestStart"] 

>hist(y, xlab = "Distance To Nearest TSS", main = "", 
breaks = 1000, xlim = c(min(y)-100, max(y) + 100)) 

Such a plot generated from the putative STATl-bind- 
ing regions identified in un-stimulated cells ([2]) shows 
that the STATl-binding sites are enriched in regions 
near TSSs (Figure 1). A pie chart was also generated as 
follows to show the distribution of relative position of 
the peaks to the nearest gene (Figure 2). 

>temp = as.data.frame(annotatedPeak) 

>pie(table(temp [as.character(temp$fromOverlappin- 
gOrNearest) == "Overlapping" \ (as.character(temp$fro- 
mOverlappingOrNearest) == "NearestStart" & Hemp 
$peak %in% temp[as.character(temp$fromOverlappin- 
gOrNearest) == "Overlapping",]$peak),]$insideFeature)) 

It is also possible to obtain the annotation on-line 
from Ensembl using the getAnnotation function as fol- 
lows prior to calling annotatePeaklnBatch. Please refer 
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Figure 1 Distribution of STATl-binding sites relative to nearest 
TSSs in human. The histogram was generated from the putative 
STATl-binding regions within 20 kb around TSS identified in un- 
stimulated cells [2], The plot shows that STATl-binding sites are 
enriched in regions more symmetrically around transcription start 
sites. The mean distance from the nearest TSS is 8533391 ± 295725 
bases (mean ± SEM). 
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Figure 2 Pie chart of STAT1 -binding sites relative to nearest 
gene in human. The pie chart was generated from the putative 
ST AT1 -binding regions identified in un-stimulated cells [2]. The plot 
shows that STAT1 -binding sites are evenly distributed along 
upstream, downstream and inside genes. 



to the biomaRt package documentation for a list of 
available biomarts and datasets [18]. 

>mart = useMart(biomart="ensembl", dataset= 
"hsapiens_gene_ensembl") 

> Annotation = getAnnotation(mart, featureType="TSS") 

To annotate the peaks with other genomic features, it is 
necessary to change the featureType (e.g., "exon" to find 
the nearest exon, "miRNA" to find the nearest miRNA, 
"5utr" to find the nearest 5' utr, and "3utr" to find the 
nearest 3' utr). It is also possible to pass customized 
annotation data into the function annotatePeaklnBatch. 
For example, the user may have a list of transcription fac- 
tor binding sites from the literature, from a different bio- 
logical replicate, from a different peak-calling algorithm 
or from a different protein functioning as transcription 
complex together with the protein in study, and is inter- 
ested in determining the extent of the overlap to the list 
of peaks from his/her experiment. Prior to calling the 
function annotatePeaklnBatch, it is necessary to repre- 
sent both datasets as RangedData, where start is the start 
of the binding site, end is the end of the binding site, 
names is the name of the binding site, and space and 
strand are the chromosome name and strand, respec- 
tively, where the binding site is located. 

>myexp = RangedData(IRanges(start = c(967654, 

2010897, 2496704), end = c(9677S4, 2010997, 2496804), 
names = cf'Sitel", "Site2", "Site3")), space = c("l", "2", "3")) 

>literature = RangedData(IRanges(start = c(967659, 

2010898, 2496700, 3075866, 3123260), end = c(967869, 
2011108, 2496920, 3076166, 3123470), names = c("tl", 
"t2", "t3", "t4", "tS")), space = c("l", "2", "3", "1", "2"), 
strand = c(l, 1, -1,-1,1)) 

>annotatedPeakl = annotatePeakInBatch(myexp, 
AnnotationData = literature, output=' "overlapping", mul- 
tiple = F, maxgap = 0) 



Peaks in text format from peak-calling algorithms can 
be easily imported to R as data frame then converted to 
RangedData. For binding sites represented in BED or 
GFF format, BED2RangedData or GFF2RangedData 
were provided for converting these data formats to 
RangedData. 

Example 2: Determining the significance of the 
overlapping and visualizing the overlap as a Venn 
diagram among different datasets 

The second example describes how to determine the 
significance of the overlap, visualize the overlap in a 
Venn diagram and obtain merged peaks from different 
datasets such as different biological replicates, different 
peak-calling algorithms or different proteins functioning 
as a transcription complex. Here we give examples using 
different biological replicates. 

The first step is to load the ChlPpeakAnno package 
and three example datasets as RangedData that contains 
putative Stel2-binding regions identified in yeast from 
three biological replicates [31]. 

>library( ChlPpeakAnno) 

>data(Peaks.Stel2.Replicatel ) 

> da ta(Peaks. Stel2. Replica te2) 

>data(Peaks.Stel2.Replicate3) 

In the next step, the function makeVennDiagram is 
called to generate a Venn diagram to visualize the over- 
lap among the three replicates. In addition, pair-wise 
overlapping significance tests were performed with 
hypergeometric test and p-values were generated. The 
parameter NameOfPeaks indicates the names of the 
dataset for labeling the Venn diagram. The parameter 
maxgap indicates the maximum distance between two 
peak ranges for them to be considered overlapping. The 
parameter totalTest indicates how many potential peaks 
in total that is used in hypergeometric test. 

>makeVennDiagram(RangedDataList(Peaks.Stel2. 
Replicatel, Peaks. Stel2.Replicate2, Peaks.Stel2.RepU- 
cate3), NameOfPeaks = c('Replicatel","Replicate2",'Repli- 
cate3"), maxgap = 0, totalTest = 1580) 

As a result, a Venn diagram was generated for visua- 
lizing the overlap among the above three replicates. The 
pair-wise overlap comparisons show that the peaks iden- 
tified from the replicates overlap significantly (Figure 3, 
p-value < 0.0001). The same analysis was applied to the 
three biological replicates of Cse4 and the overlap 
between replicate 1 and 2 is significant at p-value < 0.01 
while the other two overlapping is significant at p-value 
< 0.05 (Figure 4). 

The peak ranges from replicates do not overlap per- 
fectly. It is desirable to combine all the overlapping peaks 
across replicates into merged peaks that cover all the 
overlapping peaks from the replicates. Calling the func- 
tion findOverlappingPeaks can generate the merged 
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Replicate3 589 



Figure 3 Overlapping of putative Ste12-binding sites among 
three biological replicates in yeast The Venn diagram was 
generated from the putative Ste1 2-binding sites of three biological 
replicates in yeast [31]. Hypergeometric test shows that there is a 
significant overlapping among the replicates (p-value < 0.000001 for 
all three pair-wise comparisons). 



Replicatel Replicate2 




Replicate3 34 



Figure 4 Overlapping of putative Cse4-binding sites among 
three biological replicates in yeast The Venn diagram was 
generated from the putative Cse4-binding sites of three biological 
replicates in yeast [31]. Hypergeometric test shows that the 
overlapping between biological replicate 1 and 2 is significant at 
p-value < 0.001 while the other two overlapping are significant at 
p-value < 0.05. 



peaks. Besides the parameters mentioned previously, an 
additional required parameter multiple indicates whether 
to return merged peaks from multiple overlapping peaks. 

>MergedPeaks = findOverlappingPeaks(findOverlap- 
pingPeaks(Peaks.Stel2.Replicatel, Peaks.Stel2.Replicate2, 
maxgap = 0, multiple = F, NameOfPeaksl = "Rl ", Name- 
OfPeaks2 = "R2")$MergedPeaks, Peaks.Stel2.Replicate3, 
maxgap = 0, multiple = F, NameOfPeaksl = "Replica- 
telRepliate2", NameOfPeaks2 = "R3")$MergedPeaks 

Next, nearest genes and distances between peak location 
and nearest TSS can be obtained by annotating the merged 
peaks with SGD1.01 using annotatePeaklnBatch function 
illustrated in example 1 (Figure 5 &6). The same analysis 
was performed with Cse4 binding-sites (Figure 7 &8). The 
un-equal variance t-test shows that the distribution of the 
distance of Stel2-binding sites to nearest TSSs (Figure 5, 
264 ± 36 bases) is very different from that of Cse4-binding 
sites (Figure 6, 311 ± 160 bases) (p-value = 0.001). Stel2- 
binding sites are distributed more towards the upstream 
of the gene (Figure 5 &6) while Cse4-binding sites are 
distributed more inside and downstream of the gene 
(Figure 7 &8). This result is consistent with what has been 
observed previously [31]. The annotated datasets are 
available in Additional file 1 and Additional file 2. 

Example 3: Obtaining the sequences around the binding 
sites for PCR amplification or motif discovery 

The third example describes how to obtain the 
sequences surrounding binding sites (in this example, 




I I I I I 

-2000 -1000 0 1000 2000 



Distance To Nearest TSS 
Figure 5 Distribution of Stel 2-binding sites relative to nearest 
TSSs in yeast. The histogram was generated from the putative 
Stel 2-binding regions merged from three biological replicates 
identified in yeast [31]. The plot shows that Stel 2-binding sites are 
enriched in regions upstream and around transcription start sites. 
The mean of the distance to the nearest TSS is -264 ± 36 bases 
(mean ± SEM). 
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Figure 6 Pie chart of Stel 2-binding sites relative to nearest 
gene in yeast. The pie chart was generated from the putative 
Stel 2-binding regions merged from three biological replicates 
identified in yeast [31] The plot shows that Stel 2-binding sites are 
distributed more on upstream and around the genes. 
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Figure 8 Pie chart of Cse4-binding sites relative to nearest 
gene in yeast. The pie chart was generated from the putative 
Cse4-binding regions merged from three biological replicates 
identified in yeast [31]. The plot shows that Cse4-binding sites are 
distributed more on inside and overlap the end of the genes. 



100 bp of upstream and downstream sequence) for PCR 
amplification, cloning or motif discovery [3,30]. 

The first step is to load the ChlPpeakAnno package and 
create an example peak dataset as RangedData. Next, the 
organism-specific BSgenome package is loaded followed 
by calling the function getAUPeakSequence. The function 
available.genomes shows a list of available organism- 
specific BSgenome data packages. In this example, E. coli 
data package is used due its light-weight. 




o 



I 1 1 1 1 1 

-2000 -1000 0 1000 2000 3000 

Distance To Nearest TSS 
Figure 7 Distribution of Cse4-binding sites relative to nearest 
TSSs in yeast. The histogram was generated from the putative 
Cse4-binding regions merged from three biological replicates 
identified in yeast [31]. The plot shows that Cse4-binding sites are 
enriched in regions inside and downstream of the transcription start 
sites. The mean of the distance to the nearest TSS is 31 1 ±160 
bases (mean ± SEM). 



>libmry( ChlPpeakAnno) 

ypeaks = RangedData(IRanges(start = c(100, 500), end = 
c(300, 600), names = c("peakl", "peak2")), space = c 
("NC_008253", "NC_010468")) 

>library(BSgenome.Ecoli.NCBI.20080805) 

ypeaksWithSequences = getAUPeakSequence(peaks, 
upstream = 100, downstream = 100, genome = Ecoli) 

To convert the sequences to a common FASTA file 
format, the following function is called. 

>write2FASTA(peaksWithSequences, file= "test.fa", 
width = 50) 

Sequences for merged Stel2 binding sites were 
obtained from package BSgenome.Scerevisiae.UCSC.sac- 
Cer2 (Additional file 3). Significant motifs (E-value < 
0.0000001) were identified by running MEME [30] with 
motif occurrence set as ZOOP, minimum width as 8, 
maximum width as 20 and other parameters as default 
(Figure 9). 




Figure 9 Motif of Stel 2-binding sites in yeast The motif was 
generated using MEME [30] and the sequences from the putative 
Stel 2-binding regions merged from three biological replicates 
identified in yeast [31]. The default parameters for MEME were 
chosen except that motif occurrence was set to ZOOP, minimum 
width to 8, and maximum width to 20. 
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Example 4: Obtaining enriched GO terms near the peaks 

The fourth example describes how to obtain a list of 
enriched GO terms associated with adjacent genes using 
a hypergeometric test. 

The first step is to load the TSS annotated peak, 
which is the result returned from calling the function 
annotatePeaklnBatch, and the organism-specific GO 
gene mapping package (e.g., org.Hs.eg.db for the GO 
gene mapping for human; for other organisms, please 
refer to http://www.bioconductor.org/packages/release/ 
data/annotation/ for additional org.xx.eg.db packages). 

>data(annotatedPeak) 

>library(o rg. Hs.eg.db) 

The next step is to call the function getEnrichedGO. 
The parameter maxP is the maximum p-value required 
to be considered to be significant, multiAdj indicates 
whether to apply multiple hypothesis testing adjustment, 
minGOterm is the minimum count in a genome for a 
GO term to be included, and multiAdj Method is the 
multiple testing procedure to be applied (for details, see 
mt.rawp2adjp in the multtest package). 

>enrichedGO < -getEnrichedGO (annotatedPeak [1:6,], 
orgAnn="org.Hs.eg.db", maxP = 0.01, multiAdj = TRUE, 
minGOterm = 10, multiAdjMethod="BH") 

Where enrichedGO$bp contains a list of enriched GO 
biological process, enrichedGO$mf contains a list of 
enriched GO molecular functions and enrichedGO$cc 
contains a list of enriched GO cellular components. 

Table 2 shows a list of enriched GO terms for yeast 
transcription factor Stel2 [31]. 

Conclusions 

ChlPpeakAnno enables batch annotation of binding sites 
identified from ChlP-seq, ChlP-chip, CAGE or any tech- 
nology that results in a large number of enriched geno- 
mic regions for any species with existing annotation 



data within the statistical programming environment R. 
Allowing users to pass their own annotation data such 
as different ChIP preparation and a dataset from litera- 
ture, or existing annotation packages, such as Genomic- 
Features and BSgenome, provides flexibility while 
the tight integration to the biomaRt package enables 
up-to-date annotation retrieval from the BioMart data- 
base. The main advantage of ChlPpeakAnno is the abil- 
ity/flexibility to plug in with other annotation packages, 
ChlP-chip analysis packages, other fast moving deep- 
sequencing analysis capabilities and infrastructure and 
statistical analysis tools in Bioconductor. Another advan- 
tage of ChlPpeakAnno is that it enables comparison 
between a set of peaks with any annotation feature 
objects, between two sets of peaks from replicate experi- 
ments or transcription factors within a complex and 
determination of the significance of the overlap. 

The ChlPpeakAnno package provides documentation 
in the form of an interactive manual illustrating the 
usage of individual functions as well as a vignette con- 
taining executable code snippets demonstrating a 
case-oriented help session. The vignette is run at 
package building and installation time, and thus also 
serves as a testing suite. Some of the examples 
described in the paper are also demonstrated in the 
vignette. 

Availability and requirement 

ChlPpeakAnno is an open source software package under 
the GNU General Public Licence v2.0 and has been con- 
tributed to the Bioconductor Project. The software, source 
code and documentation are available for download from 
http://www.bioconductor.org or installed from R by typing 
source http://bioconductor.Org/biocLite.R and biocLite 
("ChlPpeakAnno"). The package has been tested and 
run on OS X, Windows and various Linux systems. 



Table 2 Enriched GO terms of Ste12-binding sites in yeast. 



GO ID 


GO Term 


GO Definition 


Category 


FDR 


GO:0055114 


oxidation reduction 


The process of removal or addition of one or more electrons with or without the 
concomitant removal or addition of a proton or protons. 


BP 


0.018 


GO:0008270 


zinc ion binding 


Interacting selectively and non-covalently with zinc (Zn) ions. 


MF 


0.047 


GO:0043167 


ion binding 


Interacting selectively and non-covalently with ions, charged atoms or groups of atoms. 


MF 


0.046 


GO:0043169 


cation binding 


Interacting selectively and non-covalently with cations, charged atoms or groups of atoms 
with a net positive charge. 


MF 


0.046 


GO:0043565 


sequence-specific 
DNA binding 


Interacting selectively and non-covalently with DNA of a specific nucleotide composition, 
e.g. GC-rich DNA binding, or with a specific sequence motif or type of DNA e.g. promotor 
binding or rDNA binding. 


MF 


0.046 


GO:0046914 


transition metal ion 
binding 


Interacting selectively and non-covalently with a transition metal ions; a transition metal is 
an element whose atom has an incomplete d-subshell of extranuclear electrons, or which 


MF 


0.046 



gives rise to a cation or cations with an incomplete d-subshell. Transition metals often have 
more than one valency state. Biologically relevant transition metals include vanadium, 
manganese, iron, copper, cobalt, nickel, molybdenum and silver. 



The enriched GO terms were obtained from the putative Ste12-binding regions merged from three biological replicates identified in yeast [31]. The parameter 
used to generate the list is maxP = 0.05, multipAdj = TRUE, minGOterm = 5 and multiAdjMethod ="BH". 
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ChlPpeakAnno depends on R version 2.10. 0 or later and 
the following Bioconductor packages: biomaRt, multtest, 
IRanges, limma, Biostrings, BSgenome, and GO.db. In addi- 
tion, the lightweight organism-specific package BSgenome. 
EcolLNCBI. 20080805 and org.Hs.eg.db were installed dur- 
ing build time for testing the code snippets in the vignette. 
All these packages can be downloaded from Bioconductor 
or installed from R using the http://bioconductor.org/bio- 
cLite.R script. 
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